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ABSTRACT. The probabilistic graphs framework models the uncertainty inherent in real- 
world domains by means of probabilistic edges whose value quantifies the likelihood of 
the edge existence or the strength of the link it represents. The goal of this paper is to 
provide a learning method to compute the most likely relationship between two nodes in 
a framework based on probabilistic graphs. In particular, given a probabilistic graph we 
adopted the language-constraint reachability method to compute the probability of possible 
interconnections that may exists between two nodes. Each of these connections may be 
viewed as feature, or a factor, between the two nodes and the corresponding probability as 
its weight. Each observed link is considered as a positive instance for its corresponding 
link label. Given the training set of observed links a L2-regularized Logistic Regression 
has been adopted to learn a model able to predict unobserved link labels. The experiments 
on a real world collaborative filtering problem proved that the proposed approach achieves 
better results than that obtained adopting classical methods. 



1. INTRODUCTION 

Over the last few years the extension of graph structures with uncertainty has become 
an important research topic |fT9l |26] |27] [121 . leading to probabilistic grapl^ model. Prob- 
abilistic graphs model uncertainty by means of probabilistic edges whose value quantifies 
the likelihood of the edge existence or the strength of the link it represents. One of the 
main issues in probabilistic graphs is how to compute the connectivity of the network. The 
network reliability problem fl4] is a generalization of the pairwise reachability, in which the 
goal is to determine the probability that all pairs of nodes are reachable from one another. 
Unlike a deterministic graph in which the reachability function is a binary value function 
indicating whether or not there is a path connecting two nodes, in the case of probabilistic 
graphs the function assumes probabilistic values. 

The concept of reachability in probabilistic graphs is used, along with its specialization, 
as a tool to compute how two nodes in the graph are likely to be connected. Reachability 
plays an important role in wide range of applications, such as in peer-to-peer networks [3. 
[181 . for probabilistic -routing problem 02] [10], in road network Bill , and in trust analysis in 
social networks [22|.As adopted in these works, reachability is quite similar to the general 
concept of link prediction [9 1, whose task may be formalized as follows. Given a networked 
structure (V, E) made up of a set of data instances V and set of observed links E among 
some nodes in V, the task corresponds to predict how likely should exist an unobserved 
link between two nodes in the network. 

The extension to probabilistic graphs adds an important ingredient that should be ade- 
quately exploited. The key difference with respect to classical link prediction methods is 
that here the observed connections between two nodes cannot be considered always true, 
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and hence methods exploiting probabilistic links are needed. Link prediction can be spe- 
cialized into link existence prediction, where one wants to asses whether two nodes should 
be connected, and link classification, where one is interested in computing the most likely 
relationship existing between two nodes. 

The goal of this paper is to provide a learning method to compute the most likely re- 
lationship between two nodes in probabilistic graphs. In particular, given a probabilistic 
graph we adopted the reachability tool to compute the probability of some possible inter- 
connections that may exists between two nodes. Each of these connections may be viewed 
as a feature, or a factor, between the two nodes and the corresponding probability as its 
weight. Each observed labeled link is considered as a positive instance for its correspond- 
ing link label. In particular, the link label corresponds to the value of the output variable j/i, 
and the features between the two nodes, computed with the reachability tool, correspond to 
the components of the corresponding vector Xj. Given the training set T> = {(xj, yi)}2=i, 
obtained from n observed links, a L2 -regularized Logistic Regression has been adopted to 
learn a model to be used to predict unobserved link labels. 

The application domain we chosen corresponds to the problem of recommender sys- 
tems 0, where the aim is to predict the unknown rating between an user and an item. The 
experiments on a real-world dataset prove that the proposed approach achieves better re- 
sults than that obtained with models induced by Singular Value Decomposition (S VD) l20l 
on the user-item ratings matrix, representing one of the best recent methods for this kind 
of task [ 15 1. The paper is organized as follows: Section|2]presents the probabilistic graphs 
framework, Section [3] describes the proposed link classification approach, Section de- 
scribes related works, and finally Section|4]shows the experimental results. 

2. PROBABILISTIC GRAPHS 

Let G = (V, E), be a graph where V is a collection of nodes and E G V x V is the set 
of edges, or relationships, between the nodes. 

Definition 1 (Probabilistic graph). A probabilistic graph is a system G — (V, E, E, ly, Ie, P e ), 
where (V, E) is an undirected graph, V is the set of nodes, E is the set of edges, E is a 
set of labels, ly : V — >■ E is a function assigning labels to nodes, Ie ■ E —¥ E is a func- 
tion assigning labels to the edges, and P e : E — > [0, 1] is a function assigning existence 
probability values to the edges. 

The existence probability P e (a) of an edge a — (u, v) G E is the probability that the 
edge a, between u and v, can exist in the graph. A particular case of probabilistic graph is 
the discrete graplQ where binary edges between nodes represent the presence or absence 
of a relationship between them, i.e., the existence probability value on all observed edges 
is 1. 

The possible world semantics is usually used for probabilistic graphs. We can imagine 
a probabilistic graph G as a sampler of worlds, where each world is an instance of G. A 
discrete graph G' is sampled from G according to the probability distribution P e , denoted 
as G' E G, when each edge a 6 E is selected to be an edge of G' with probability P e (a). 
Edges labeled with probabilities are treated as mutually independent random variables in- 
dicating whether or not the corresponding edge belongs to a discrete graph. 
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Assuming independence among edges, the probability distribution over discrete graphs 

G' = {V, E')QG = (V, E) is given by 

(1) P{G'\G)= [] Pe(a) n (i-W)- 

aeE' aGE\E' 

Definition 2 (Simple path). Given a probabilistic graph G, a simple path of a length k 
from u to v in G is a sequence of edges p u _ v = (ei, e2, . . . e^), such that e\ = (u, V\), 
efe = {vk x ,v), and ei — (Wj-i, Vi) for 1 < i < k, and all nodes in the path are distinct. 

Given G a probabilistic graph, and p s 4 = (ex, . . ■ e^) a simple path in G from node 
s to node t, l(p s .t) = ^(ei)/_E(e2) ■ ■ ■ lE^k) denotes the concatenation of the labels of all 
edges in p s _t- In order to give the following definition, we recall that given a context free 
grammar (CFG) C a string of terminals s is derivable from C iff s G L(C), where L(C) is 
the language generated from C. 

Definition 3 (Language constrained simple path). Given a probabilistic graph G and a 
context free grammar C, a language constrained simple path is a simple path p such that 

l( P )eL(C). 

2.1. Inference. Given a probabilistic graph G a main task corresponds to compute the 
probability that there exists a simple path between two nodes u and v, that is, querying for 
the probability that a randomly sampled discrete graph contains a simple path between u 
and v. More formally, the existence probability P e (q\G) of a simple path q in a probabilis- 
tic graph G corresponds to the marginal P(G'\G) with respect to q: 

(2) P e (q\G) = P(q\G')-P(G'\G) 

G'CG 

where P(q\G') = 1 if there exits the simple path q in G", and P(q\G') = otherwise. In 
other words, the existence probability of the simple path q is the probability that the simple 
path q exists in a randomly sampled discrete graph. 

Definition 4 (Language constrained simple path probability). Given a probabilistic graph 
G and a context free grammar C, the language constrained simple path probability of L(C) 
is 

(3) P{L{C)\G) = £ P(q\G',L(C)) ■ P(G'\G) 

G'CG 

where P(q\G' ', L(C) — 1 if there exists a simple path q in G' such that l(q) £ L{C), and 
P{q\G',L{C)) = otherwise. 

In particular, the previous definition give us the possibility to compute the probability of 
a set of simple path queries fulfilling the structure imposed by a context free grammar. In 
this way we are interested in discrete graphs that contain at least one simple path belonging 
to the language corresponding to the given grammar. 

Computing the existence probability directly using (f2} or ([3) is intensive and intractable 
for large graphs since the number of discrete graphs to be checked is exponential in the 
number of probabilistic edges. It involves computing the existence of the simple path in 
every discrete graph and accumulating their probability. A natural way to overcome the 
intractability of computing the existence probability of a simple path is to approximate it 
using a Monte Carlo sampling approach |[T3l : 1) we sample n possible discrete graphs, 
Gi, G2, ■ ■ ■ G n from G by sampling edges uniformly at random according to their edge 
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probabilities; and 2) we check if the simple path exists in each sampled graph d. This 
process provides the following basic sampling estimator for P e (q\G): 

(4) PMG) = E:Ll ^ (g ' Gl) 

Note that is not necessary to sample all edges to check whether the graph contains the 
path. For instance, assuming to use an iterative depth first search procedure to check the 
path existence. When a node is just visited, we will sample all its adjacent edges and 
pushing them into the stack used by the iterative procedure. We will stop the procedure 
either when the target node is reached or when the stack is empty (non existence). 



3. LINK CLASSIFICATION 

After having defined the probabilistic graph, now we can adopt language constrained 
simple paths in order to extract probabilistic features to describe the link between two 
nodes in the graph. 

Given a probabilistic graph G, with the set V of nodes and the set E of edges, and 
Y C S a set of edge labels, we have a set of edges D C E such that for each element 
e G D: Ie{&) £ Y. In particular D represents the set of observed links whose label 
belongs to the set Y, Given the set of training links D and the set of labels Y we want to 
learn a model able to correctly classify unobserved links. 

3.1. Query based classification. A way to solve the classification task can be that of 
using a language based classification approach. Given an unobserved edge ej = (v,i, Uj), 
in order to predict its class yl e Y we can solve the following maximization problem: 

(5) yl = axgmax P(q 3 (ui,Vi)\G), 

j 

where qj(ui,Vi) is the unknown link with label qj 6 Y between the nodes u% and Uj. 
In particular, the maximization problem corresponds to compute the link prediction for 
each qj 6 Y and then choosing that label with maximum likelihood. The previous link 
prediction task is based on querying the probability of some language constrained simple 
path. In particular, predicting the probability of the label qj as P(qj(ui,Vi)\G) in (0 
corresponds to compute the probability P(q\G) for a query path in a language Lj, i.e., 
computing P(Lj\G) as in <£3j : 

(6) jjj — aigmaji P(qj(iii,Vi)\G) « argmaxP(Lj|G). 

3 3 

3.2. Feature based classification. The previous query based classification approach con- 
sider the languages used to compute the © as independent form each other without consid- 
ering any correlation between them. A more interesting approach that we want investigate 
in this paper is to learn from the probabilistic graph a linear model of classification com- 
bining the prediction of each language constrained simple path. 

In particular, given an edge e and a set of k languages C — {L\, . . . , we can gener- 
ate k real valued features Xi where Xi = P(Li\G), 1 < i < k. The original training set of 
observed links D can hence be transformed into the set of instances T> = {(xj, J/i)}t=i n > 
where is a fc-component vector of features Xij £ [0, 1], and yi is the class label of the 
corresponding example x^. 
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3.2. 1 . L2-regularized Logistic Regression. Linear classification represents one of the most 
promising learning technique for problems with a huge number of instances and features 
aiming at learning a weight vector w as a model. L2-regularized Logistic Regression 
belongs to the class of linear classifier and solves the following unconstrained optimization 
problem: 



(7) min/(w) = I ^— - + C V log(l + exp(-^w J x,)) 



where log(l + exp(— 7/iW T Xi)) — £(w; x^, y{) denotes the specific loss function, |w T w 
is the regularized term, and C > is a penalty parameter. The decision function corre- 
sponds to sgn(w'xi). In case of binary classification yi € {— 1, +1}, while for multi class 
problems the one vs the rest strategy can be used. 

Among many methods for training logistic regression models, such as iterative scaling, 
nonlinear conjugate gradient, quasi Newton, a new efficient and robust truncated Newton, 
called trust region Newton method, has been proposed ifTTI . 

In order to find the parameters w minimizing /(w) it is necessary to set the derivative 
of /(w) to zero. Denoting with er(?/iW T Xi) = (1 + exp(— yiW T Xi)) _1 , we have: 



df(w) 



w + C ((j(yiW T Xi) - l) yjXi = 0. 



(9w 

i=l 

To solve the previous score equation, the Newton method requires the Hessian matrix: 

a 2 /(w) 



dwd~w T 



I + CX 1 DX, 



where X is the matrix of the x^ values, D is a diagonal matrix of weights with ith diagonal 
element cr(yiW T Xi)(l — a(yi\v T Xi)), and I is the identity matrix. 
The Newton step is 

.old , „old 



where s old is the solution of the following linear system: 

g2 /(w old) ^ d ^ d/( w ° ld ) 

<9w<9w T chv 

Instead of using this update rule, IfTTI propose a robust and efficient trust region Newton 
method, using new rules for updating the trust region, whose corresponding algorithm has 
been implemented in the LIBLINEAR0 system. 



4. EXPERIMENTAL EVALUATION 

The application domain we chosen to validate the proposed approach is that of recom- 
mender systems. In some domains both data and probabilistic relationships between them 
are observable, while in other domain, like in this used in this paper, it is necessary to elicit 
the uncertain relationships among the given evidence. 



: / / www . csie . ntu . edu ■ tw/~c jlin/liblinear| 
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4. 1 . Probabilistic graph creation. A common approach to elicit probabilistic hidden re- 
lationships between data is based on using similarity measures. To model the data with a 
graph we can adopt different similarity measures for each type of node involved in the re- 
lationships. For instance we can define a similarity measure between homogeneous nodes 
and one for heterogeneous nodes. 

In a recommender system we have two types of entities: the users and the items, and 
the only observed relationship corresponds to the ratings that a user has assigned to a set of 
items. The goal is to predict the rating a user could assign to an object that he never rated in 
the past. In the collaborative filtering approach there are two methods to predict unknown 
rating exploiting users or items similarity. User-oriented methods estimate unknown rat- 
ings based on previous ratings of similar users, while in item-oriented approaches ratings 
are estimated using previous ratings given by the same user on similar items. 

Let U be a set of n users and I a set of m items. A rating r u % indicates the preference 
degree the user u expressed for the item i, where high values mean stronger preference. 
Let S u be the set of items rated from user u. A user-based approach predicts an unobserved 
rating fQ as follows: 

(8) r m = r u H = : ^ 

T, ve u\ies n K{u,v)\ 

where rv represents the mean rating of user u, and (T u (u, v) stands for the similarity be- 
tween users u and v, computed, for instance, using the Pearson correlation: 



On the other side, item-based approaches predict the rating of a given item using the 
following formula: 



(9) 



J2jes u \j^i °»(*>i) ' r uj 



where <Ji(i,j) is the similarity between the item i and j. 

These neighbourhood approaches see each user connected to other users or consider 
each item related to other items as in a network structure. In particular they rely on the di- 
rect connections among the entities involved in the domain. However, as recently proved, 
techniques able to consider complex relationships among the entities, leveraging the in- 
formation already present in the network, involves an improvement in the processes of 
querying and mining [25 , 231 1241 . 

Given the set of observed ratings JC — {(u, i, r U i)\r U i is known}, we add a node with 
label user for each user in IC, and a node with label item for each item in /C. The next 
step is to add the edges among the nodes. Each edge is characterized by a label and a 
probability value, which should indicate the degree of similarity between the two nodes. 
Two kind of connections between nodes are added. For each user u, we added an edge, 
labeled as simU, between u and the k most similar users to u. The similarity between two 
users u and v is computed adopting a weighted Pearson correlation between the items rated 
by both u and v. In particular, the probability of the edge simU connecting two users u 
and v is computed as: 

P(simU(ii, v)) = a u (u, v) ■ w u (u, v), 
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where a u (u, v) is the Pearson correlation between the vectors of ratings corresponding to 
the set of items rated by both user u and user v, and w u (u, v) — jg^ug" . 

For each item i, we added an edge, with label siml, between i and the most k similar 
items to i. In particular, the probability of the edge siml connecting the item i to the item 
j has been computed as: 



where ai(i,j) is the Pearson correlation between the vectors corresponding to the his- 
togram of the set of ratings for the item i and the item j, and Wi(i, j) = |§^§4> where Si 
is the set of users rating the item i. 

Finally, edges with probability equal to 1, and with label between the user u and the 
item i, denoting the user u has rated the item i with a score equal to k, are added for each 
element r U i belonging to JC. 

4.2. Feature construction. Let us assume that the values of r u , are discrete and belonging 
to a set R. Given the recommender probabilistic graph G, the query based classification ap- 
proach, as reported in Section |3~Tl try to solve the problem fQ = argmaxj P(rj(u, i)\G), 
where rj(u, i) is the unknown link with label rj between the user u and the item i. This 
link prediction task is based on querying the probability of some language constrained sim- 
ple path. For instance, a user-based collaborative filtering approach may be obtained by 
querying the probability of the paths, starting from a user node and ending to an item node, 
belonging to the context free language (CFL) Li — {simU 1 r J 1 }. In particular, predict- 
ing the probability of the rating j as P(zj(u, i)) corresponds to compute the probability 
P(q\G) for a query path in Lj, i.e., fQ = argmaxj P(rj(u, i)\G) ~ argmaxj P(Lj\G). 

In the same way, item-based approach could be obtained by computing the probability 
of the paths belonging to the CFL Li = {r^ siml 1 }. The power of the proposed frame- 
work gives us the possibility to construct more complex queries such as that belonging to 
the CFL Li = {risiml™ : 1 < n < 2}, that gives us the possibility to explore the graph 
by considering not only direct connections. Hybrid queries, such as those belonging to the 
CFL Li = {rjsiml" : 1 < n < 2} U {simU m r,- : 1 < m < 2}, give us the possibility 
to combine the user information with item information. 

In order to use the feature based classification approach proposed in this paper we 
can define a set of CFLs C and then computing for each language Li G C the prob- 
ability P(Li\G) between a given user and all the items the user rated. In particular, 
the set of observed ratings /C = {(u, i, r U i)\r U i is known} is mapped to the training set 
T> = {(x,i, t/,)}i=i,...,n, where Xij is the probability P(Lj\G) between the nodes u and i, 
and yi is equal to r u $ . 

The proposed link classification method has been implemented in the Eagle systerrfl 
that provides a set of tools to deal with probabilistic graphs. 

4.3. Dataset. In order to validate the proposed approach we used the MovieLens datase^, 
made available by the GroupLens research group at University of Minnesota for the 2nd 
International Workshop on Information Heterogeneity and Fusion in Recommender Sys- 
tems. We used the MovieLens 100K version consisting of 100000 ratings (ranging from 
1 to 5) regarding 943 users and 1682 movies, whose class distribution is reported in Ta- 
bleQ] Each user has rated at least 20 movies and there are simple demographic info for the 
users (such as age, gender, occupation, and zip code). The data was collected through the 



P(siml(i,j)) 



<ri(ij) ■ Wi{i,j) 




jhttp : / /ir . ii . uam.es/ hetrec2 1 1 / dataset s . html| 
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Table 1 . MovieLens dataset class distribution. 



rl 


r2 


r3 


r4 


r5 


6110 


11370 


27145 


34174 


21201 



Table 2. Language constrained simple paths used for the MovieLens dataset. 



Li = 


{simU 1 ^} 












L 2 = 


{r^simF 1 } 












L 3 = 


{r£simF n 


1 < 


n 


<2} 






U = 


{simU™rJ. 


1 < 


n 


<2} 






u = 


{simU"rJ. 


1 < 


n 


< 2}U{r£simF n 


1 < 


n < 2} 


L 6 = 


{r£.simF n 


1 < 


n 


<3} 






L r = 


{simU™r£ 


1 < 


n 


<3} 






L 8 = 


{simU™r^ 


1 < 


n 


< 3}U{r£simF n 


1 < 


n < 3} 


L 9 = 


{simU n r£ 


1 < 


n 


< 4} U {4simF™ 


1 < 


n < 4} 



MovieLens web site during the seven-month period from September 19th, 1997 through 
April 22nd, 1998. In this paper we used the ratings only without considering the demo- 
graphic information. MovieLens 100K dataset is divided in 5 fold, where each fold present 
a training data (consisting of 80000 ratings) and a test data (with 20000 ratings). 

For each training/testing fold the validation procedure followed the following steps: 

(1) creating the probabilistic graph from the training ratings data set as reported Sec- 
tion @7T] 

(2) defining a set C of context free languages corresponding to be used to construct a 
specific set of features as described in Section 1431 

(3) learning the L2 -regularized Logistic Regression model; and 

(4) testing the ratings reported in the testing data set T by computing, for each pair 
(u, i) e T the predicted rating adopting the learned classification model and com- 
paring the result with the true prediction reported in T. 

For the graph construction, edges are added using the procedure presented in Sec- 
tion l4.ll where we set the parameter n = 30, indicating that an user or a film is connected, 
respectively, to 30 most similar users, resp. films. The value of each feature have been 
obtained with the Monte Carlo inference procedure by sampling 100 discrete graphs. 

In order to construct the set of features, we proposed to query the paths belonging to 
the set of languages £ reported in Table |2] The first language constrained simple paths 
L\ corresponds to adopt a user-based approach, while the second language L 2 gives us 
the possibility to apply an item-based approach. Then, we propose to extend the basic 
languages L\ and L 2 in order to construct features that consider a neighbourhood with 
many nested levels. In particular, instead of considering the direct neighbours only, we 
inspect the probabilistic graph following a path with a maximum length of two (L3 and L4) 
and three edges (Lq and L7). Finally, we constructed hybrid features by combining both 
the user-based and item-based methods and the large neighbourhood explored with paths 
whose length is greater than one (L5, Lg and Lg). We defined two sets of features T\ — 
{Li,L 2 ,L 3 , Li, L 5 }, based on simple languages, and T 2 = {L 3 , L 4 , L 5 , L 6 , L 7l L 8 , L 9 }, 
exploiting more complex queries. In order to learn the classification model as reported in 
Section [3.2.1l we used the L2 -regularized Logistic Regression implementation included in 
the LIBLINEAR system 03. 
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TABLE 3. MAE M values obtained with Eagle and SVD on MovieLens dataset. 



Fold 


SVD 


Eagle@Ji 


Eagle@J"2 


u c 


1 


0.9021 


0.8424 


0.8255 




2 


0.9034 


0.8332 


0.8279 




3 


0.9111 


0.8464 


0.8362 




4 


0.9081 


0.8527 


0.8372 




5 


0.9159 


0.8596 


0.8502 




Mean 


0.908±0.006 


0.847±0.01 


0.835±0.01 


1.6 1.51 


p-value 




2.3E-6 


5.09E-7 





Given a set T of testing instances, the accuracy of the proposed framework has been 
evaluated according to the mac roave raging mean absolute error (MAE M ) (TJ: 




where Tj C T denotes the set of test rating whose true class is j. 

4.4. Results. Table [3] shows the results obtained adopting the proposed approach imple- 
mented in the Eagle system when compared to those obtained with the RecSys SVD 
approach based implementation^. The first row reports the mean value of the MAE M aver- 
aged on the five folds obtained with an SVD approach and with the proposed classification 
method as implemented in the Eagle system. As we can see the error achieved by our 
method is lower than that obtained by the SVD method. The results improve when we 
use the set Ti of features. The difference of the results obtained with the two methods 
is statistically significant, with a p-value for the t-test equal to 0.0000023 when using the 
set T\ of features, and equal to 0.000000509 for the other set of features. The last two 
columns report the results of two baseline methods. The second last column reports the 
results obtained with a system that predicts a rating adopting a uniform distribution, while 
the last column reports the results of a system that uses a categorical distribution that pre- 
dicts the value k of a rating with probability pp. = \Dp.\/N, where is the number of 
ratings belonging to the dataset having value k, and N is the total number of ratings. 

In Table H] we can see the errors committed by each method on each rating class. The 
rows for the methods U and C report the mean of the MAE M value for each fold using 
a system adopting a uniform or a categorical distribution. The dataset is not balanced 
as reported in the Table Q] As we can see both the SVD and the Eagle system adhere 
more to the categorical distribution proving that they are able to recognize the unbalanced 
distribution of the dataset 

5. RELATED WORKS 

In |fT9l the authors provide a list of alternative shortest-path distance measures for prob- 
abilistic graphs in order to discover the k closest nodes to a given node. Their work is 
related to the that of stochastic shortest path problem that deals with the computing of the 
probability density function of the shortest path length for a pair of nodes [8|. They pro- 
vide a scalable solution for the k-NN problem by using a direct sampling approach that 
approximates the shortest-path probability between two nodes adopting a sampling of n 



s : //github . com/ocelma/python-recsys 
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TABLE 4. MAE M values for each class obtained with Eagle and SVD 
on MovieLens dataset. 



Fold 


Method 


rl 


r2 


i-3 


i-4 


r5 




SVD 


1.58 


1.04 


0.56 


0.44 


0.86 


1 


Eagle @J-% 


1.11 


0.76 


0.69 


0.61 


1.02 




Eagle® J2 


1.03 


0.75 


0.71 


0.63 


0.99 




SVD 


1.60 


1.04 


0.55 


0.43 


0.87 


2 


Eagle@Ji 


1.11 


0.77 


0.67 


0.58 


1.02 




Eagle @J" 2 


1.05 


0.77 


0.68 


0.60 


1.00 




SVD 


1.65 


0.99 


0.55 
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possible discrete graphs from the probabilistic graph and hence computing the shortest 
path distance in each sampled discrete graph. In [6 1, the problem of finding a shortest path 
on a probabilistic graph is addressed by transforming each edge probability to its expected 
value and then running the Dijkstra algorithm. 

Authors in |l3 | investigated a more generalized and informative distance-constraint 
reachability (DCR) query problem: given two nodes s and t in an probabilistic graph G, 
the aim is to compute the probability that the distance form s to t is less than or equal to d. 
They show that the simple reachability problem without constraint becomes a special case 
of the distance-constraint reachability, considering the case where the threshold d is larger 
than the length of the longest path. In order to solve the DCR problem they provide an 
estimator based on a direct sampling approach and two new estimators based on unequal 
probability sampling and recursive sampling ifjjl . Furthermore, they proposed a divide 
and conquer exact algorithm that compute exact s-t DCR by recursively partitioning all the 
possible discrete graphs from the probabilistic graph into groups so that the reachability of 
these groups can be computed easily. 

The need to model the uncertainty inherent in the data has increased the attention on 
probabilistic databases. In this framework exact approaches are infeasible for large data- 
base |5| and hence the research has focused on computing approximate answers |[T4l . An 
important probabilistic databases issue regards the efficient evaluation of top-k queries. A 
traditional top-k query returns the k objects with the maximum scores based on some scor- 
ing function. In the uncertain world the scoring function becomes a probabilistic function. 
ETIl formalized the problem and [16] proposed a unified approach to ranking in probabilis- 
tic databases. 
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In this paper we adopt the probabilistic graphs framework to deal with uncertain prob- 
lems exploiting both edges probabilistic values and edges labels denoting the type of re- 
lationships between two nodes. Our work exploits the reachability tool using a direct 
sampling approach and considers as a constraint, instead of the number of visited edges or 
the likelihood of the path, the concatenation of the labels of the visited edges going from a 
node to another. We can consider the approach proposed in this paper as a generalization 
of the DCR problem since we can consider homogeneous labels and a constraint length of 
the paths. 

6. CONCLUSIONS 

In this paper the Eagle system integrating a framework based on probabilistic graphs 
able to deal with link prediction problems adopting reachability has been presented. We 
proposed a learning method to compute the most likely relationship between two nodes 
in probabilistic graphs. In particular, we used a probabilistic graph in order to represent 
uncertain data and relationships and we adopted the reachability tool to compute the prob- 
ability of unknown interconnections between two nodes not directly connected. Each of 
these connections may be viewed as probabilistic features and we can describe each ob- 
served link in the graph as a feature vector. Given the training set of observed links a 
L2-regularized Logistic Regression has been adopted to learn a model able to predict the 
label of unobserved links. The application domain we chosen corresponds to the problem 
of recommender systems. The experimental evaluation proved that the proposed approach 
achieves better results when compared to that obtained with models induced by Singular 
Value Decomposition on the user-item ratings matrix, representing one of the best recent 
method for this kind of problem. 
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